System and method for managing the performance of a computer system based on operational characteristics of the system components

ABSTRACT

A performance manager and method for managing the performance of a computer system based on a system model that includes measured entities representing the operational characteristics of the system components and relationships among the measured entities. The performance manager includes data producers for interacting with the interface agents of the components, an engine for exchanging information with the data producers and the system model, and an interaction model for determining relevant measured entities in the system model. The system model and interaction model are maintained in a repository where data might be accessed via an access interface. Incoming performance data is analyzed by an analyzer in the background to detect trends and relationships among the entities. An operator might review the relevant entities and apply controls to selected entities to manage the overall system performance as well as to resolve problems affecting the performance of the components in the system.

TECHNICAL FIELD

This invention relates to computer systems, and more particularly to asystem and method for managing the performance of the computer systembased on a system model that includes performance characteristics of thesystem components and their relationships.

BACKGROUND OF THE INVENTION

A performance problem in a system component usually appears to thesystem operator as a poor response time for an application running inthe system. This is the case because the application typically dependson many resources in the system for its execution, including memory,storage switches, disk drives, networks, etc. For any one applicationthere may be hundreds of different resources with the potential to causeperformance problems by being unable to satisfy the demand. Over a wholesystem there may be many thousands of such interrelated entities.

Currently, programs may be set up to monitor the performance of thecomponents separately. The results are gathered into a central tool forthe operator's examination. A disadvantage of this approach is that itrelies on the operator's understanding and experience as to howmeasurements and events from different components are related. With thescale of computer systems continuing to grow, it is very difficult forthe operator to manage the performance of the systems and identifysystem problems accurately. Furthermore, the information for eachcomponent must generally be quantized into one of a few possible states.

U.S. Patent application No. 2002/0083371A1 describes a method formonitoring performance of a network which includes storing topology andlogical relation information. The method attempts to help a user indetermining the main causes of network problems. A drawback of thisapproach is that it limits dependencies between components to thephysical and logical topology of the system, where the logical topologyis a subset of the physical topology. In a networked storage systemthere might be performance dependencies between components which are notdirectly connected in the physical topology. Another drawback of thismethod is that although the user may “drill down” to the source of aproblem by navigating through “bad” states of the components, theobserved problem and the actual cause might be connected through a chainof entities that are themselves not in a bad state.

U.S. Pat. No. 6,393,386 describes a system for monitoring complexdistributed systems. The monitoring system builds a dynamic model of themonitored system and uses changes in the state of the monitoredcomponents to determine the entities affected by a change. It correlatesevents which might have been caused by the same problem to identify thesource of the problem. As such, the user of the system does not have theability to investigate conditions which the system has not recognized asfaulty or degraded conditions. In addition, the system searches forreasons for a particular degradation or failure of a node in the system.Since only the nodes directly connected to the affected node areconsidered, this approach might lead to incomplete analysis if thesystem model does not completely specify all relationships between theentities.

U.S. Pat. No. 5,528,516 describes an apparatus and method correlatingevents and reporting problems in a system of managed components. Theinvention includes a process of creating a causality matrix relating toobservable symptoms that are likely the problems. This process reducesthe causality matrix into a minimal codebook, monitors the observablesymptoms, and identify the problems by comparing the observable symptomsagainst the minimal codebook using various best-fit approaches. However,in a complex networked storage system, there might be several causes fora single observed problem which requires different approaches toidentify these causes. In such a situation, a solution implemented by acompletely automated sub-system, as described in U.S. Pat. No.5,528,516, might not be the ideal one for the user.

Therefore, there remains a need for a system and method for managing theperformance of a computer system that help the operator effectivelytrack the performance of individual components and accurately identifyproblems affecting the system performance.

SUMMARY OF THE INVENTION

It is an object of the present invention to provide a system and methodfor managing the performance of the computer system based on a modelthat includes the performance characteristics of the system components.

It is another object of the invention to provide system model thatmaintains information about the parameters affecting the operation ofthe system components as measured entities and relationships among theentities.

It is still another object of the invention to provide an interactionmodel for maintaining information about an operator's interaction withthe system model to help the operator identify and resolve a problemaffecting the system performance.

It is a further object of the invention to provide an operator interfacethrough which the operator might review the performance of thecomponents and initiate changes to improve the system performance or toresolve performance problems.

To achieve these and other objects, the present invention provides aperformance manager that includes a system model, one or more dataproducers, and an interaction model. The system model represents thestate of the computer system based on the operational characteristics ofthe system components as supplied by the data producers. The systemmodel includes measured entities that correspond to the components inthe system and relationships that represent the effect one measuredentity has on the operation of another entity. Each measured entity isassociated with as set of metrics and controls for changing the metrics.The data producers communicate with the managed components through theinterface agents associated with the components and provide the systemmodel with the performance, configuration and diagnostic informationabout the components. The interaction model determines the most relevantentities affecting the performance which are then presented to theoperator through a visualizer. Through the interaction model, thevisualizer also allows the operator to apply changes to the controls ofselected measured entities.

The performance manager also includes an engine for exchanginginformation with the data producers and the interaction model. Theinteraction model and the system model are typically maintained in adata repository such as a relational database. Components of theperformance manager access data in the repository through an accessinterface. In the preferred embodiment of the invention, the performancemanager further comprises an analyzer for analyzing in the backgroundthe performance data and detecting trends affecting the performance ofthe system.

The invention also includes a method for managing the performance of acomputer system using the performance manager. The method allows theoperator to examine the relevant measured entities, their measurements,and associated metrics. Through the visualizer, the operator mightaffect controls to the selected entities to improve the systemperformance and identify the components that need attention ininteraction sessions.

Additional objects and advantages of the present invention will be setforth in the description which follows, and in part will be obvious fromthe description and with the accompanying drawing, or may be learnedfrom the practice of this invention.

BRIEF DESCRIPTION OF THE DRAWING

FIG. 1 is a schematic diagram showing the performance manager of theinvention together with a typical managed computer system.

FIG. 2 is a block diagram showing the main components of the performancemanager of the invention.

FIG. 3 is a block diagram showing a preferred embodiment of the systemmodel in accordance with the invention.

FIG. 4 illustrates a typical relationship between an independentmeasured entity and a dependent measured entity.

FIG. 5 illustrates an example of the Forward Coupling Strength matrixand the Forward Confidence matrix that are derived from a relationshipbetween a disk measured entity and an application measured entity.

FIG. 6 is a block diagram showing a preferred embodiment of theinteraction model in accordance with the invention.

FIG. 7 is a flow chart showing a preferred process for computing anentity priority as part of the interaction model.

FIG. 8 illustrates a display by the visualizer to an operator as aresult of an interaction session by the operator.

FIG. 9 is a flowchart showing a preferred process for diagnosing asystem performance using the performance manager of the invention.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

The invention will be described primarily as a system and method formanaging the performance of a computer system based on a model thatincludes performance parameters of the components in the system.However, persons skilled in the art will recognize that an apparatus,such as a data processing system, including a CPU, memory, I/O, programstorage, a connecting bus, and other appropriate components, could beprogrammed or otherwise designed to facilitate the practice of themethod of the invention. Such a system would include appropriate programmeans for executing the operations of the invention.

Also, an article of manufacture, such as a pre-recorded disk or othersimilar computer program product, for use with a data processing system,could include a storage medium and program means recorded thereon fordirecting the data processing system to facilitate the practice of themethod of the invention. Such apparatus and articles of manufacture alsofall within the spirit and scope of the invention.

FIG. 1 is a block diagram showing a simple computer system that ismanaged by a performance manager in accordance with the invention. Thecomputer system includes multiple storage client hosts 101, a storagefabric 102 between the client hosts 101 and back-end storage devicessuch as a disk 103 and a RAID storage array 104. Each client host 101might have one or more of processor 106, memory 107, application 108,file system 109, logical volume manager 110 and storage device drivers111. Although FIG. 1 shows a simple computer system, a typical computercomplex managed by the invention may include hundreds of client hosts101 that are connected to hundreds of back-end storage devices 103-104through multiple sub-networks as the storage fabric 102. The hosts 101communicate with the storage fabric 102 through storage interfaces 112.The storage fabric 102 usually comprises several network switches andnetwork devices like the switch 114. A performance manager 116 managesthe components in the system through the storage fabric 102.

In accordance with the invention, some of the components in the managedsystem include a management interface through which external entitiescan extract performance information and operational state informationfrom the components and can also control certain operational parametersof the components. Most system components like client hosts, servers,network switches/hubs and storage devices already have such managementinterfaces. An example of management interfaces that can be exposed bystorage devices is described in “Bluefin—A Common Interface for SANManagement,” a White Paper by the Storage Networking IndustryAssociation, Disk Resource Management Work Group (SNIA-DRM), August,2002. If a particular component needs to be managed but does not includea management interface, then an agent like the Simple Network ManagementProtocol (SNMP) agent may be added to the component. It is well-known inthe art how an SNMP agent can be implemented to provide the requiredinterface functions. See, for example, “Essential SNMP,” by DouglasMauro et al, October, 2001.

FIG. 1 shows an agent 105 for supporting the components 106-111 in aclient host 101. An agent 113 supports the switch 114 of the storagefabric 102, while an agent 115 supports the storage array 104. Thenumber of agents needed to manage a computer system and the decision asto which system components are supported by a particular agent depend ona particular system configuration and the desired implementation.Typically, one or more agents are set up to support a group of similaror related components. Using the operations described below in referenceto FIGS. 2-8, the agents in the system, such as the agents 105, 113, and115, provide the performance manager 116 with the performancecharacteristics of the managed components and allow their operationalparameters to be controlled to improve the overall system performance.The performance manager 116 can interact with the agents over either thesame storage fabric (in-band communication) or a separate multi-purposenetwork (out-of-band communication).

FIG. 2 is a block diagram showing the main components of the performancemanager of the invention. The performance manager 20 includes one ormore data producers 28, an engine 21, and a repository 24 that containsa system model 25 and an access interface 26. The access interface 26includes an interaction model 27. The data producers 28 communicate withthe agents associated with the managed components, typically in aone-to-one relationship. The repository 24 is where the performance dataon the system components are recorded and retrieved reliably by theperformance manager 20. The system model 25 contains the performancecharacteristics of the system components and their dependencies.

The repository 24 could be an off-the-shelf relational databasemanagement system with a database schema for supporting the systemmanagement functions. The repository 24 includes appropriate accessmethods for external components to access data in the repository via theaccess interface 26. The schema and access methods support the storage,maintenance, and retrieval of a system model 25 which represents thestate of the computer system, in the form of data objects. The accessinterface 26 has a set of access methods and maintains information aboutthe interaction model 27. The interaction model 27 keeps track of stateinformation needed to allow the data accessors to interact with thesystem model 25.

The engine 21 includes functions for communicating with the dataproducers 28 and the access interface 26 of the repository 24. Throughthe engine 21 and the access interface 26, the data producers 28 provideperformance information to the system model 25. The engine 21 isresponsible for exchanging information with the data producers 28 eitherthrough a “pull” model, where the engine initiates the exchange or a“push” model where the data producer initiates the exchange. The engine21 is also responsible for deciding which information needs to be madepersistent in the repository 24 and how to evaluate the operationalstatus of the managed components, in conjunction with the data producers28.

Each data producer 28 presents data to the repository 24, through theengine 21, where the data represent the current operational status ofone or more components in the managed system. The data producers 28 areparts of the performance manager 20 that have the ability to communicatewith the management interfaces or agents for the components in themanaged system. Each data producer 28 provides the engine 21 with a listof the data objects, referred to as measured entities and describedbelow in reference to FIGS. 3-5. The measured entities correspond to thecomponents that the data producer are associated with. Each measuredentity includes the operational characteristics, referred to as metrics,of the corresponding component. The system model also includes dataobjects called relationships which represent the effect that onemeasured entity has on the operation of another measured entity. Themeasured entities and their relationships are maintained in the systemmodel. For example, for the computer system shown in FIG. 1, the systemmodel might include one measured entity for each processor 106 andmemory 107 in each host. It might also include several measured entitiesrepresenting logical volumes, information about which comes from thevolume manager 110 on the host 101.

In the preferred embodiment of the invention, the performance manager 20further includes a visualizer 22 and one or more analyzer 23. Thevisualizer 22 provides a user interface through which human users couldinteract with the performance manager 20. It includes facilities whichdisplay the status of the measured entities and the controls that affectthe entities. These facilities also allow the user to navigate throughthe displayed entities to diagnose and resolve a problem that affectsthe performance of the system.

The analyzer 23 is responsible for analyzing the incoming data andapplying data mining techniques to discover trends as well asrelationships that exist between measurements of the different entities.The data analysis is preferably implemented as a background task. Anexample data mining process is described in “Data Mining: Exploiting theHidden Trends in Your Data,” by Herb Edelstein,” DB2online Magazine,Spring 1997. The analyzer 23 is also responsible for making changes tothe system model 25, as described below, when the management system isused to apply controls to the entities in the managed system.

FIG. 3 illustrates the key components of a system model 30. The systemmodel 30 contains two types of information: one or more measuredentities 31 and one or more entity relationships 32. Each measuredentity 31 corresponds to a facility or operation in one or more managedsystem components. Associated with each measured entity 31 are one ormore metrics 33. Each metric 33 is associated with a measurement of itsmeasured entity 31 that has the potential to change over time. Whenprompted, a data producer (item 28 of FIG. 2) provides a list of allmetrics 33 associated with a given measured entity 31. It also providesthe current value and possibly recent values of any given metric. Eachmetric 33 is additionally associated with information that identifiesand describes the metric 33 to the operator and to the analyzer (item 23in FIG. 2). This information includes the identity of the data producerwhich is responsible for supplying data about the measured entity 31,including the values of the metrics 33. The descriptive data may includereference values that represent expected values, upper or lower limits,or thresholds for consequences. The descriptive data may also includeclassification by category.

Associated with each measured entity 31, there might be one or morecontrols 34 which are subject to changes by the system operator or bythe performance manager. Each data producer provides to the engine (item21 in FIG. 2) a list of all controls 34 associated with a given measuredentity 31. For each control 34, this data producer provides the enginewith information that determines how the control 34 is to be presentedin the user interface to permit the system operator or the performancemanager to make changes or take actions. This may, for example, berepresented as a Java Bean. The parameters subject to change by acontrol 34 may themselves be metrics 33 of the same measured entity 31.

An entity relationship 32 specifies knowledge in the system model abouthow two measured entities 31 are related. An entity relationship 32 canbe used, for example, to represent the correlation between theperformance of a database application and a disk on which the data forthe application is stored. These relationships are exploited by thesystem and method of the invention to improve the performance of themanaged system. Associated with the entity 32 are a Forward CouplingStrength matrix 35 and a Forward Confidence matrix 36. These matriceswill be described in reference to FIG. 5.

Referring again to FIG. 2, the performance manager 20 collects, via itsdata producers 28, and records in its repository 24 the identifying anddescriptive information of each data producer 28, measured entity 31,and metric 33. It also records periodic samples of the values ofmetrics, and records them in the repository 24 for future use. Apolicy-based mechanism might be used to determine how long the knowledgeof the metric values must be preserved.

In addition to the measured entities 31 and metrics 33, the performancemanager 20 also collects and maintains in the system model 25 dataobjects that represent the entity relationships 32. The entityrelationships 32 are created and changed in several ways as describedlater in the specification. In the simplest form, the data producers 28supply data for defining these relationships. The engine 21 provides tothe data producers 28 information that identifies and describes all ofthe measured entities 31 and metrics 33, so that a data producer 28 maycreate or modify relationship objects for measured entities 31 that areserviced by other data producers 28.

FIG. 4 illustrates how two measured entities are related to each otherby a relationship 45. The relationship 45 is represented as a largearrow originating from a measured entity 40, referred to as anindependent entity, and pointing to a measured entity 41, referred to asa dependent entity. The direction of the relationship is said to be fromthe independent to the dependent. There is at most one relationship fora given ordered pair of entities. So, it is possible that two entitiesare in two relationships, one in which the first entity is independentand another in which the second entity is independent. The entityrelationship contains information about how the metrics of theindependent measured entity affect the metrics of the dependent measuredentity, which are referred to as dependencies. FIG. 4 shows threedependencies 46, 47 and 48 between the independent entity 40 and thedependent entity 41. For example, the dependency 46 is between metric 42of entity 40 and metric 43 of entity 41.

The relationship 45 also contains data from which several couplingmatrices can be derived, each of which contains one value for eachcombination of a metric of the independent entity with a metric of thedependent entity. The basic form of these matrices has a number of rowsequal to the number of metrics of the independent measured entity and anumber of columns equal to the number of metrics of the dependentmeasured entity. An entry M(x,y), which is in the x-th row and y-thcolumn of a matrix M, relates to the x-th metric of the independentmeasured entity and y-th metric of the dependent measured entity. Thereare two basic coupling matrices, from which other matrices can bederived:

(1) The Forward Coupling Strength (FCS) matrix: the value of an entryM(x,y) in the Forward Coupling Strength matrix is a value between −1.0and 1.0. It indicates the relative magnitude and direction (i.e.,positive means increasing and negative means decreasing) of a change inthe y-th metric of the dependent entity that is caused by a change inthe x-th metric of the independent entity.

(2) The Forward Confidence (FC) matrix: the value of the entry M(x,y) inthe FC matrix is a value between 0.0 and 1.0 that indicates thelikelihood that the change described by the forward coupling strengthwill apply. A value of 1.0 means that the relationship is alwaysexpected to be present and have an effect. This could be the case, forexample, if there is a direct functional relationship between the twometrics. A value of 0.0 means that there is no knowledge of arelationship, and in this case the value of the Forward CouplingStrength matrix should be ignored.

It is possible for the independent (or parent) entity and the dependent(or child) entity in a relationship to be the same entity. This is usedto represent the fact that metrics of a measured entity are dependent onother metrics of the same entity. The coupling matrices could be storedin the form of a list enumerating all entries. However, such animplementation is usually avoided because of the potentially largeamount of data involved. The Forward Coupling Strength and ForwardConfidence matrices are considered to be sparse matrices in which mostentries are assumed to have the same default value, which could be butis not necessarily zero. Entries whose values are non-default arelisted, in the form of data objects called dependencies 46. Eachdependency specifies the identity of a cause metric, the identity of aneffect metric, and values for strength and confidence. If the systemmodel contains no relationship object for a given pair of entities,there is an implicit default relationship for which the values of allmatrix elements are zero, and the system behavior is as if arelationship with these values were contained in the system model. Theconcept of dependencies can easily extended to include controls. Forexample, for a measured entity representing an application, the metricmeasuring the transactions per second, could be the effect metric in adependency where the control for number of threads used by theapplication is the cause control.

An example of two related measured entities are shown in FIG. 5. Themetrics of the application measured entity 51 depend on the metrics ofthe disk measured entity 50. These dependencies are represented by arelationship 53, which includes two example dependencies 54 and 55. Thedependency 54 represents that the “transactions per second” metric 59 ofthe application measured entity 51 is related to the “I/O operations persecond” metric 58 of the disk measured entity 50. In the dependency 55,the “average response time” metric 57 for transactions delivered by theapplication is related to the “average response time” metric 56 for I/Ooperations delivered by the disk. FIG. 5 also shows a Forward CouplingStrength matrix 52 and a Forward Confidence matrix 521 as an example.There is a row in the matrices 52 and 521 for each metric of the diskmeasured entity 50 and a column for each metric of the applicationmeasured entity 51. As an example, the second column of the second rowpositions of the matrices 52 and 521 correspond to the dependency 55,which states that the average response time per transaction of theapplication is dependent on the average response time per I/O of thedisk. This dependency has a strength of +0.5 (from the Forward CouplingStrength matrix 52), and the management system has 90% confidence thatthis dependency exists (shown as 0.9 in the Forward Confidence matrix521).

The relationships in the system model, such as the relationship 45 ofFIG. 4, are created and modified via several paths:

(1) By the data producer: an example of this case is where the dataproducer for a storage system defines the measured entities for thelogical disks that are visible to the storage clients. The data producerwould provide the relationships between the logical disks and thephysical disk ranks that show that the load on the physical ranks isdependent on the load to the logical disks, and that the performance ofthe logical disks is dependent on the performance of the physical diskranks.

2) Manually entered: for example, the operator selects two measuredentities and designates them as related, by manual input.

3) Rules driven: a set of rules is provided to the control program thatdefines patterns for relationships. The database is scanned for pairs ofentities that match these patterns. These patterns might be expressed asSQL queries to the relational database of entities and metrics.

4) Action inferred: when the user takes actions such as the applicationof controls in response to a particular performance short-fall, theperformance manager can assume a tentative causal link between entitiesdirectly related to the shortfall and the entity to which the control isapplied.

5) Indirect (transitively inferred): when there are two relationships:one between an entity x as the independent and an entity y as thedependent, and another relationship between the entity y as theindependent and an entity z as the dependent, then a relationship may beautomatically generated with x as the independent and z as thedependent. This is done as a background activity in the system model. Apre-filtering step selects only those relationships with high enoughvalues in the Forward Coupling Strength and Forward Confidence matrices,and pairs are selected for defining a candidate relationship. Thestrength and confidence matrices of the candidate relationship arecomputed as the matrix product of the original two relationships, and athreshold-based criterion is used to determine whether the candidate isaccepted into the system model.

6) Statistically adjusted: statistical correlation of metric values isused to adjust the values in the Forward Coupling Strength matrix and,in some cases, the Forward Confidence matrix as well. In the preferredembodiment, the dependencies contained within relationships are ofdifferent types that indicate the source and the bounds of possibleconfidence, and these type values determine when and how statisticalcorrelation can be used to adjust the numerical values.

7) Navigation inferred: weak dependencies are inferred from the factthat the operator has chosen to navigate a path from a specifiedstarting point to a given focus entity in a session, even if no actionis taken.

8) Statistically inferred: data mining approaches are used to detectcorrelations between entities for which relationships are not yet known.

In accordance with the invention, a set of criteria for evaluatingmeasured entities are associated with the system model. The differentcriteria are referred to as temperature scales. Given a temperaturescale, at a point in time, the system model defines a numeric value foreach entity denoted as its Absolute Entity Temperature (AET) measuredwith respect to that scale. It likewise defines a list of numericvalues, one for each metric and control of the entity, denoted as theAbsolute Temperature Vector (ATV) measured with espect to thattemperature scale. Each value is a nonnegative real number.

Each temperature scale denotes a context of problem solving orperformance analysis, and the temperature measured with respect to itrepresents the likely importance or benefit of examining or acting onthe entity, metric or control in question within that context. Forexample, the identification of recent operational failures, theidentification of current performance shortfalls, and the prediction offuture problems are three contexts each of which may be represented by aseparate temperature scale. In general, the temperature (for a givenentity and a given temperature scale) is determined from the current andhistorical values of its metrics. For example, in the context of currentperformance shortfalls, measured entities that are failing to meetrequirements are assigned a high temperature, as are the metrics mostnearly responsible for the failures (i.e., those for which animprovement would be most likely to eliminate the measured entity'sfailure). Measured entities and metrics that are safely out of thefailure state would be assigned low temperatures, and those which may beconsidered marginal are assigned intermediate values.

Absolute temperatures, and context-specific temperatures derived fromthem, are used in the visualizer and the interaction model to determinethe interaction priorities that guide the operator. This use isdescribed below as part of the interaction model 27 of FIG. 2. Theinteraction model creates thermometers for the different temperaturescales being used in the system. The thermometer is simply an objectwhich applies the criteria of its temperature scale to an input metricor entity and is able to output the current temperature with respect toits temperature scale. Note that the term “temperature” is used to matchthe notion of “hot spot” as a component needing attention or correction.

The performance manager contains a temperature scale list that includesone or more generic scales representing general analysis contexts, likethe three example contexts provided above. Several thermometers based onthe generic scales for computing temperatures might be provided, suchas:

1) Threshold thermometer: this thermometer designates a set of metrics,and for each metric in the set, the thermometer designates a referencevalue, a weight, and a direction selector. The direction selector mighthave the value of −1, 0, or +1. For each metric, the thermometer obtainsa difference by subtracting the current measured value from thereference value. If the direction selector is zero or if the differencehas the same sign (positive or negative) as the direction selector, thenthe difference is retained as calculated, otherwise the difference isset to zero. A score function is defined as the weighted sum of squaresof these differences. The threshold thermometer is used to examinecurrent failures or potential failures with respect to performancetargets, by setting reference values at or near the target limits andusing +1 or −1 as the direction selectors. It is also used to detect andhighlight deviations from normal by the use of zero as the directionselectors.

2) Future trend thermometer: this thermometer designates duration ofconcern, a set of metrics and, for each metric in the set, a referencevalue, a weight, and a direction selector. A score function is definedas for a threshold thermometer with the same parameters, except that thecalculation uses a prediction of the metric value in the future,extrapolated by the duration of concern, in place of current measuredvalues.

3) Threshold-crossing thermometer: this thermometer indicates that ametric has exceeded a threshold. The metric temperature is set to aconstant for that metric which has crossed the threshold, and zero forall other metrics.

4) Dynamically generated thermometer: when there is a performanceshortfalls, the system model can create a dynamically generatedthermometer for the situation. For example, if two metrics of a measuredentity are both above a threshold (which causes the performance managerto generate a thermometer), then the temperature values for thosemetrics (in the ATV created with the use of this thermometer) will behigh, and the temperature values for other metrics will be low.

As an illustration of the use of thermometers to generate AETs, considerthe following example. The performance manager might notice that thedisk measured entity 50 shown in FIG. 5 has a performance shortfall inthe response time metric. Based on this situation the performancemanager creates a dynamically generated thermometer which generates anATV using the formula:

metric temperature=(current metric value)/(metric threshold).times.100]

and computes the AET as:

AET=current metric temperature of average response time metric

Thus, if the average response time target is 2 ms and the currentmeasured time is 3 ms, then the metric temperature for the averageresponse time metric and the AET for the disk entity will be(3)/(2).times.100, that is 150. The computation of absolute temperaturemay be done within the engine in reference to data contained in thesystem model, or may involve reference to the measured entity's dataproducer.

FIG. 6 illustrates a preferred embodiment of an interaction session 60.The interaction session 60 represents the state and progress of theinteraction between the operator or other agent on the one hand and thesystem model on the other. The selection and display of data and thechanges to the controls occur in association with a session. The session60 is typically created when the operator starts a particular task (suchas fixing a particular problem) and it is deleted when the task iscompleted, and the actions done to perform the task are usuallyassociated with that session. The association between operator actionsand session state happens through the visualizer (item 22 in FIG. 2).

The session 60 combines information about the problem being solved withinformation about the managed system so as to guide the operator'sdecisions. Each session 60 contains a thermometer 61, an eventidentifier 62, a measured entity set 63, a focus entity 64, a ContextTemperature Vector (CTV) 65, an entity priority function 66, and ahistory 67. The thermometer 61 identifies the temperature scale beingused by the session. The event identifier 62 either selects a particularevent contained in the system model or contains a value that designatesthat there is no association with an event. The measured entity set 63identifies a set of the measured entities considered to be relevant tothe operational characteristics of the focus measured entity. The focusentity 64 identifies one measured entity which is the current focus ofthe task, or it may indicate that no entity is the focus entity. The CTV65 specifies a nonnegative numerical value for each metric and controlof the focus entity that represents the likely importance or benefit ofchanging that metric or control in the context defined by the session.The entity priority function 66 defines numerically the relativepriority of each entity in the measured entity set 63 for the nextinteraction step. The entity priority function 66 may be a list ofnumeric values, one for each entity in the measured entity set 63. Thehistory 67 includes the sequence of interactions that have occurred inthis session. It preferably consists of a list of the focus entities 68that have been selected, their associated data and the corresponding CTV65 in the order the entities were focused on.

The computation of the entity priority function is shown in FIG. 7. Theentity priority function is defined in one of two ways, depending onwhether a focus entity (FE) and CTV are defined for the session (step70). If either the focus entity or the CTV is not defined for thesession, then the entity priority function is computed as the AbsoluteEntity Temperature (AET) of the given entity measured with thethermometer of the session, in step 71. If both the focus entity and theCTV are defined, then the value of the entity priority function for agiven entity is defined as the matrix product of:

a weight matrix computed from the coupling between entities (where thefocus entity is the dependent entity),

the transpose of an adjusted version of the CTV of the session, and

an adjusted version of the ATV of the given entity.

Specifically, the process for computing the entity priority functionincludes the following operations:

1) ACT computation (step 72): the Adjusted Context Temperature (ACT) iscomputed from the CTV so that every entry in the ACT has a value atleast equal to a parameter called gamma., set in the configuration ofthe control program. The value of .gamma. is typically about 5% of themaximum value of metric temperature recently computed. In the preferredembodiment of the invention, the value of ACT(i) is computed as:

ACT(i)=CTV(i)+.gamma..

2) CWM computation (step 73): the coupling weight matrix or CWM for agiven entity is computed from the relationship, if any, for which thegiven entity is the independent and the focus entity is the dependent.If such a relationship is present in the system model, then the CWM iscomputed as a combination of the Forward Coupling Strength matrix andthe Forward Confidence matrix, element by element. In the preferredembodiment of the invention, this is accomplished by combining thematrices element by corresponding element, via the function:

CWM(mA, mF)=(.epsilon.+abs(Strength(mA, mF)).times.Confidence(mA, mF))

where mF and mA are the identifiers for the metrics of the focus entityand the other entity, epsilon. is a constant adjusted in theconfiguration of the control program, and abs( ) is the absolute valuefunction. A representative value for .epsilon. is 0.1.

3) TSV computation (step 74): the Temperature Sensitivity Vector (TSV)for a given entity is computed by multiplying the CWM for the givenentity with the transpose of the ACT, by the usual rules of matrixarithmetic. The TSV has a value for each metric or control of the givenentity.

4) AAT computation (step 75): the Adjusted Absolute Temperature (AAT)for a given entity is computed from the ATV of the given entity measuredwith the thermometer of the session, so that every entry in the AAT hasa value at least equal to a parameter called eta., set in theconfiguration of the control program. The value of .eta. is typicallyabout 5% of the maximum value of metric temperature recently computed.In the preferred embodiment of the invention, the value of AAT(i) iscalculated as:

AAT(i)=ATV(i)+.eta..

5) Priority function computation (step 76): the priority function valuefor a given entity is computed as the scalar product of TSV for thegiven entity with the AAT for the given entity.

As an illustration, consider the relationship shown in FIG. 5. Assumethat the ATV for the disk measured entity is (60, 150), the ATV for theapplication measured entity is (40, 125, 65), and that the applicationmeasured entity is the first entity selected in an interaction session.The application measured entity becomes the focus entity and the CTV isset to be the ATV of the application measured entity. The computation ofthe entity priority value for the disk measured entity proceeds asfollows:

1) ACT Computation (Step 72):

Assume that the value of .gamma. is 10.

ACT=[40+10, 125+10, 65+10]=[50, 135, 75]

2) CWM Computation (Step 73):

Assume that the value of .epsilon. is 0.1,

CWM(1, 1)=(0.1+abs(0.8).times.0.9=0.81

CWM(2, 2)=(0.1+abs(0.5).times.0.9=0.54

CWM(1,2)=CWM(1,3)=CWM(2,1)=CWM(2,3)=(0.1+abs(0.0)).times.0.0=0.0

Thus, the CWM is the following matrix: 1 [0.81 0.0 0.0 0.0 0.54 0.0 ]

3) TSV Computation (Step 74): 2 [0.81 0.0 0.0 0.0 0.54 0.0][50 135 75]=[0.5 72.9 ]

4) AAT Computation (Step 75):

Suppose the value of .eta. is 10, then

AAT=[60+10, 150+10]=[70, 160]

5) Priority Function Computation (Step 76):

Scalar product of 3 [40.5 72.9]

and [70, 160] is ((40.5.times.70)+(72.9.times.160))=14499

The primary part of the entity priority value is the factor(72.9.times.160), which is based on the temperatures of the averageresponse time metrics of both the disk measured entity and theapplication measured entity. This is desired because in the example, theapplication response time exceeds its threshold value and the entitiesaffecting this response time should have a higher priority when anoperator looks for the source of the problem.

In the preferred embodiment of the invention, the intermediate vectorsand matrices are not computed explicitly but their contributions to thepriority function value are computed and aggregated. The interactionmodel may also contain control data such as security information foridentifying individual operators and the data to which they have access.In addition, each session (item 60 in FIG. 6) contained in theinteraction model is created and used via the visualizer. A display ispresented to the operator of the entities in the measured entity set ofthe session. The rendering of the display directs the operator'sattention primarily to the focus entity (if one is defined) and to thoseentities that have the highest value of the entity priority function. Italso permits the operator to select one of these entities as the nextfocus entity (item 64 in FIG. 6), or to change the other objectscontained in the session. The display may provide additional informationon the entities that it presents, such as the current values of theirmetrics, and options to allow the operator to act on the controls of theentities.

FIG. 8 shows an example of the display to the operator. The display ispreferably associated with an interaction session and is displayed as awindow on the operator's console. It includes a table 80 in which eachrow corresponds to a measured entity, with the entities having the highscores being at the top of the table 80. The operator may select one ofthe entities as the next focus entity using a pointing device like amouse. Other cues like color, icon size and apparent distance, clarity,and sounds might also be used to direct the operator's attention. Asecond table 81 presents additional data on a particular entity, whichneed not be the focus entity, and includes a facility to view metricsand to select and act on the controls of this particular entity. Thedisplay further has a table 82 which shows the history of the operator'snavigation and allows the operator to back-track to any of the previousnavigation steps.

The visualizer allows an operator to use an interaction session tonavigate to the source of a performance problem while the interactionmodel session (item 60 in FIG. 6) keeps track of this navigation. Thevisualizer provides the following main functions for navigation withstated consequences:

FESELECT (f): Select entity (f) as the new focus entity (item 64 in FIG.6), a forward step.

CONTROL (e, c, a, v): Apply action (a) to control (c) in entity (e) withan optional association to event (v).

The FESELECT(f) navigation action selects entity (f) as the new focusentity (item 64 in FIG. 6) of the interaction session. The user mighteffect this action by selecting an entity represented in table 80 usinga pointing device. The following actions result from this step:

The current focus entity is appended to the history (item 67 in FIG. 6)of the session along with the CTV.

A new CTV is computed and applied. If the CTV is defined in the currentsession state, then a new value for the CTV is computed with valuesequal to the temperature sensitivity vector or TSV for the entitydesignated as the new focus entity, as defined above. If the CTV is notdefined in the current session state then a new value for the CTV iscomputed as the ATV of the newly chosen entity, measured with thethermometer (item 61 in FIG. 6) of the session.

The focus entity is set to the newly chosen entity.

The entity priority function (item 66 in FIG. 6) is computed andrecorded at the time of the action, according to the procedure specifiedabove.

The display is adjusted to match the changes in session state.

The CONTROL(e, c, a, v) navigation action applies action (a) to control(c) in entity (e) with an optional association to event (v). The usermight effect this action by using input devices to manipulate table 81.The following steps result from this action:

If no event (v) is specified, and an event is defined for the session,then the event of the session is used as the value v in the steps below.

The control change is effected by communication to the system model andthe control analyzer.

The analyzer is notified of the control action and its parameters, andof the session history (item 67 in FIG. 6).

The actions of the analyzer based on the notification of a controlaction from the interaction session cause changes to the relationshipsin the system model. In brief, the system is updated to show that thecontrol (c) of entity (e) is a likely cause of changes in the metricswhich were important when the session was started. An initial change invalues is made with a moderate confidence level just because this actionwas chosen, and this is followed up later by increased strength andconfidence if the monitor data confirms that the metrics exhibited achange after the action was taken.

In addition to the core navigation operations described above, thepreferred embodiment of the invention provides the following functionsfor navigating through an interaction session.

REVERT: Return to a previous session state

REFRESH: Refresh the session state

EVENT (v): Select an event (v) with which the session is to beassociated.

Referring again to FIG. 6, the REVERT action restores the values of thefocus entity (item 64), CTV (item 65, if one is defined) and entitypriority function (item 66) to the values most recently saved in thehistory 67. Those values are then removed from the history. The operatormight effect this action by using a pointing device to select the lastentity shown in the history table 82 of FIG. 8.

The REFRESH navigation action recomputes the CTV 65 (if one is defined)and the entity priority function 66 based on current monitoring data inthe repository. The display is adjusted to the changed values.

The EVENT(v) navigation action allows the operator to choose an eventassociated with the session during the course of the navigation. Theevent identifier 61 of the session is set to that of the supplied event.As a result, the thermometer of the event is used or future updates.

FIG. 9 illustrates a process through which the operator might use theperformance manager to diagnose a problem in the managed system. At step90, the operator creates an interaction section. At step 91, theoperator examines entity scores in the table 80 of FIG. 8 to see if forone of the displayed measured entity might be the possible cause of theproblem. If a measured entity is the likely cause, the operator appliescontrols for that entity in step 92. If none of the displayed entitiesappears as the likely source of the problem, the operator uses theFESELECT(f) function to navigate to a new measured entities based ontheir scores. The FESELECT(f) function is repeated until a likely causeof the problem is identified and corrected.

While the present invention has been particularly shown and describedwith reference to the preferred embodiments, it will be understood bythose skilled in the art that various changes in form and detail may bemade without departing from the spirit and scope of the inventionAccordingly, the disclosed invention is to be considered merely asillustrative and limited in scope only as specified in the appendedclaims.

1. A computer-implemented performance manager for use with a computersystem, the computer system including a plurality of components eachassociated with a set of performance characteristics, the performancemanager comprising: a system model representing the state of thecomputer system and including a plurality of measured entities andrelationships among the plurality of measured entities, the plurality ofmeasured entities representing performance characteristics of theplurality of components; a plurality of data producers for providing thesystem model with performance information about the plurality ofcomponents; an interaction model for determining a set of most relevantentities affecting the computer system performance; and an analyzer foranalyzing the performance information and detecting relationships amongthe plurality of measured entities, each relationship being representedas a data object in the system model and containing information abouthow metrics of an independent measured entity affect metrics of adependent measured entity, wherein each of the plurality of measuredentities corresponds to a set of metrics and controls, the controlscapable of changing the performance characteristics of the componentsassociated with each of the plurality of measured entities, theplurality of measured entities comprises at least one independent entityhaving a number of metrics and at least one dependent entity having anumber of metrics, the system model further comprises a forward couplingstrength matrix comprising a plurality of coupling strength elementsarranged in a number of rows equal to the number of metrics for theindependent entity and a number of columns equal to the number ofmetrics for the dependent entity, each coupling strength element has avalue in a range from −1 to 1, a value of a coupling strength element isrepresentative of a magnitude and a direction of a change in a firstmeasured entity resulting from a change in a second measured entity, thesystem model further comprises a forward confidence matrix comprising aplurality of confidence elements arranged in a number of rows equal tothe number of metrics for the independent entity and a number of columnsequal to the number of metrics for the dependent entity, each confidenceelement has a value in a range from 0 to 1, a value of a confidenceelement is representative of a probability that a corresponding elementin the forward coupling strength matrix affects a relationship betweenthe independent entity and the dependent entity, and a matrixmultiplication of the forward coupling strength matrix and the forwardconfidence matrix has a matrix result representative of the relationshipbetween the independent entity and the dependent entity.
 2. Theperformance manager as recited in claim 1 further comprising avisualizer for presenting the set of most relevant entities to a systemoperator and allowing the operator to apply changes to the componentperformance characteristics through the interaction model.
 3. Theperformance manager as recited in claim 1, wherein the plurality of dataproducers communicate with the plurality of components in the computersystem through a plurality of interface agents.
 4. The performancemanager as recited in claim 1 further comprising an engine between theplurality of data producers and the interaction model for exchangingdata with the plurality of data producers and providing data to thesystem model.
 5. The performance manager as recited in claim 1, whereinthe system model and interaction model are maintained in a datarepository.
 6. The performance manager as recited in claim 5, whereinthe data repository further includes an access interface to allow accessto data in the repository.
 7. The performance manager as recited inclaim 1, wherein each relationship among the plurality of measuredentities includes a plurality of dependencies each corresponding to afirst metric in a first measured entity and a second metric in a secondmeasured entity.
 8. The performance manager as recited in claim 7,wherein each dependency of the plurality of dependencies is associatedwith a strength and a confidence, the strength indicating a degree ofcorrelation between the respective first and second metrics, theconfidence indicating a likelihood that the dependency exists.
 9. Theperformance manager as recited in claim 1, wherein relationships arecreated and modified based on operator input.
 10. The performancemanager as recited in claim 1, wherein relationships are created andmodified based on component performance information.
 11. The performancemanager as recited in claim 1, wherein the relationships among themeasured entities are created and modified based on information providedby the plurality of data producers.
 12. The performance manager asrecited in claim 1, wherein the system model evaluates the plurality ofmeasured entities based on a plurality of temperature scales andthermometers, each temperature scale denoting a context for analyzingthe performance of the system, and each thermometer designating aperformance measurement as compared to a threshold value.
 13. Acomputer-program product for managing the performance of a computersystem, the computer system including a plurality of components eachassociated with a set of performance characteristics, thecomputer-program product comprising: a computer-readable medium; means,provided on the computer-readable medium, for representing the state ofthe computer system as a system model, the system model including aplurality of measured entities and relationships among the plurality ofmeasured entities, the plurality of measured entities representingperformance characteristics of the plurality of components; means,provided on the computer-readable medium, for providing the system modelwith performance, configuration and diagnostic information about thecomponents; means, provided on the computer-readable medium, for forminga forward coupling strength matrix comprising a plurality of couplingstrength elements, wherein each of the plurality of coupling strengthelement has a value in a range from −1 to 1 and the value of a couplingstrength element is representative of a magnitude and a direction of achange in a first measured entity resulting from a change in a secondmeasured entity; means, provided on the computer-readable medium, forforming a forward confidence matrix comprising a plurality of confidenceelements, wherein each of the plurality of confidence elements has avalue in a range from 0 to 1 and the value of a confidence element isrepresentative of a probability that a corresponding element in theforward coupling strength matrix affects a relationship between a firstselected measured entity and a second selected measured entity; means,provided on the computer-readable medium, for forming a matrixrepresentation of a relationship between two selected measured entities,wherein the matrix representation is a result of a matrix multiplicationbetween the forward coupling strength matrix and the forward confidencematrix, means, provided on the computer-readable medium, for determininga set of most relevant entities affecting computer system performancebased on the system model; means, provided on the computer-readablemedium, for analyzing the performance information and detectingrelationships among the plurality of measured entities, eachrelationship being represented as a data object in the system model andcontaining information about how metrics of an independent measuredentity affect metrics of a dependent measured entity, wherein eachmeasured entity is associated with a plurality of metrics and controls,the controls affecting; the operation of the components corresponding toeach measured entity of the plurality of measured entities; and meansprovided on the computer-readable medium for changing the controlsassociated with the most relevant measured entities to improve theperformance of the computer system.
 14. The computer-program product asrecited in claim 13 further comprising means, provided on thecomputer-readable medium, for presenting the set of most relevantentities to a system operator in an interaction session.
 15. Thecomputer-program product as recited in claim 13 further comprisingmeans, provided on the computer-readable medium, for diagnosing aperformance problem in the computer system using the performancemanager.
 16. The computer-program product as recited in claim 13 furthercomprising: means, provided on the computer-readable medium, forselecting one of the measured entities as a focus entity; and means,provided on the computer-readable medium, for computing an entitypriority function based on the focus entity and the entity relationship.